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Abstract 

Semisupervised methods inevitably invoke 
some assumption that links the marginal dis- 
tribution of the features to the regression func- 
tion of the label. Most commonly, the clus- 
ter or manifold assumptions are used which 
imply that the regression function is smooth 
over high-density clusters or manifolds sup- 
porting the data. A generalization of these 
assumptions is that the regression function is 
smooth with respect to some density sensitive 
distance. This motivates the use of a den- 
sity based metric [Bousquet et al. |2004, Coif-] 



et al. 2008a I . In the special case where the clus- 



ters are manifolds, this is called the manifold assump- 



man and Lafon 2006 Sajama and Orhtsky 
20051 for semisupervised learning. We ana 



lyze this setting and make the following con- 
tributions - (a) we propose a semi-supervised 
learner that uses a density-sensitive kernel and 
show that it provides better performance than 
any supervised learner if the density support 
set has a small condition number and (b) we 
show that it is possible to adapt to the degree 
of semi-supervisedness using data-dependent 
choice of a parameter that controls sensitiv- 
ity of the distance metric to the density. This 
ensures that the semisupervised learner never 
performs worse than a supervised learner even 
if the assumptions fail to hold. 



1 Introduction 

Semisupervised methods inevitably invoke some as- 
sumption that links the marginal distribution p{x) of the 
features X to the regression function f{x) = E[Y\X = 
x] of the label Y. The most common assumption is 
the cluster assumption in which it is assumed that / is 
very smooth wherever p exhibits clusters [Lafferty and 
Wassermanl ,2007, iRigollet, ,2007, iSeeger, ,2000, .Singh, 



tion PBelkin and Niyogi 2004| [Lafferty and Wasserman 
2007,Niyogi,2008r 



A generalization of the cluster and manifold assumptions 
is that the regression function is smooth with respect to 
some density-sensitive distance. Several recent papers 
propose using a density based metric or diffusion dis- 
tance for semisupervised learning (Bousquet et al. 2004] 
Coifman and Lafon 2006 Sajama and Orhtsky | |2005f. 



In this paper, we analyze semisupervised inference under 
this generalized assumption. 

Singh, Nowak and Zhu 1 2008a |, Lafferty and Wasserman 
120071 and Nadler et al |2009| have showed that the de- 
gree to which unlabeled data improves performance is 
very sensitive to the cluster and manifold assumptions. 
In this paper, we introduce adaptive semisupervised in- 
ference. We define a parameter a that controls the sen- 
sitivity of the distance metric to the density, and hence 
the strength of the semisupervised assumption. When 
a = there is no semisupervised assumption, that is, 
there is no link between / and p. When a = oo there 
is a very strong semisupervised assumption. We use the 
data to estimate a and hence we adapt to the appropriate 
assumption hnking / and p. 

This paper makes the following contributions - (a) we 
propose a semi-supervised learner that uses a density- 
sensitive kernel and show that it provides better perfor- 
mance than any supervised learner if the density support 
set has a small condition number and (b) we show that it 
is possible to adapt to the degree of semi-supervisedness 
using data-dependent choice of a parameter that controls 
sensitivity of the distance metric to the density. This 
ensures that the semisupervised learner never performs 
worse than a supervised learner even if the assumptions 
fail to hold. Preliminary simulations, to be reported in fu- 
ture work, confirmed that our proposed estimator adapts 
well to alpha and has good risk when the semisupervised 



smoothness holds and when it fails. 

Related Work. There are a number of papers that dis- 
cuss conditions under which semisupervised methods 
can succeed or that discuss metrics that are useful for 



semisupervised methods. These include Bousquet et al. 
p004), [Sin gh et al.' f2008b|, [Nadler et al | p009| , |Sa-" 



jama and Orlitsky 1 2005 J and references therein. How- 
ever, to the best of our knowledge, there are no papers 
that explicitly study adaptive methods that allow the data 
to choose the strength of the semisupervised assumption. 

Outline. This paper is organized as follows. In Section|2] 
we define a set of joint distributions Vxy{oi) indexed by 
a. In Section [3] we define a density sensitive estimator 
fa of /, assuming that G Vxy{oi)- We find finite 

sample bounds on the error of and we investigate the 
dependence of this error on a. In Section[4j we show that 
cross-vaUdation can be used to adapt to a. We conclude 
in Section|5] 

2 Definitions 

We consider the collection of joint distributions 
VxYici) = T-'x X T-'y\x indexed by a density-sensitivity 
parameter a as follows. X, Y are random variables, X is 
supported on a compact domain X C M'^, and Y is real- 
valued. The marginal density G [Ao, Aq] is bounded 
over its support {x : p{x) > 0}, where < Aq, Aq < oo. 
Also, let the conditional density be p{y\x) with variance 
bounded by cr^, and conditional label mean or regression 
function be f{x) = E[Y\X = x], with < M. We 

say that (p, /) e VxYicn) if these functions satisfy the 
properties described below. 

Before stating the properties of / and p, we define a dis- 
tance metric with density sensitivity a. 

Density-sensitive distance: We consider the following 
distance with density sensitivity a £ [0, oo) between two 
points xi,X2 G X that is a modification of the definition 
in 



Sajama and OrhtskylpOOSj : 



Da{xi,X2) 



inf 

■y€r(xi ,X2) 



p{-f{t)y 



-dt. 



(1) 



where r(a:i, a;2) is the set of all continuous finite curves 
from xi to X2 with unit speed everywhere and ^(7) is the 
length of curve 7 (i.e. 7(L(7)) — X2). Notice that large 
a makes points connected by high density paths closer, 
and a = corresponds to Euclidean distance. 

Our first assumption is that the regression function / is 
smooth with respect to the density sensitive distance: 

Al) Semisupervised smoothness: The regression func- 



tion f{x) = K[Y\X ~ x] is /3-smooth with respect to the 
density-sensitive distance Da, i.e. there exists constants 
Ci, /3 > such that for all xi,X2 G X 

|/(X1)-/(X2)| < Ci [Da{xi,X2) 

In particular if a = and /3 = 1, this corresponds to 
Lipschitz smoothness. 

Our second assumption is that the density function p is 
smooth with respect to Euclidean distance over the sup- 
port set. Recall that the support of p is S" = {x : p{x) > 
0}. 

A2) Density smoothness: The density function p{x) is 
Holder 77-smooth with respect to Euclidean distance if it 
has [77J derivatives and there exists a constant C2 > 
such that for all xi,X2 E S 

|p(xi)-T4''J(xi)|<C2||xi-X2||", 

where [77] is the largest integer such that [77J < rj, and 
Tx2^ is the Taylor polynomial of degree [77J around the 
point X2. 

The condition number of a set S with boundary dS is the 
largest real number r > such that, if d{x, dS) < r then 
X has a unique projection onto the boundary of S. Here, 
d{x, dS) — inizedS — When t is large, S can- 
not be too thin, the boundaries of S cannot be too curved 
and S cannot get too close to being self-intersecting. If 
S consists of more than one connected component, then 
T large also means that the connected components can- 
not be too close to each other. Let tq denote the smallest 
condition number of the support sets S of all p G Vx- 
We shall see that semisupervised inference outperforms 
supervised inference when tq is small. Additionally, we 
assume that S has at most K < 00 connected compo- 
nents. 

In the supervised setting, we assume access to n labeled 
data C = {Xi, Yi}'^^i drawn i.i.d. from VxYio^), and in 
the semi-supervised setting, we assume access to m ad- 
ditional unlabeled data U = {Xi}"^^ drawn i.i.d. from 
Vx. 

As usual, we write a„ = 0(6„) if |a„/6„| is bounded for 
all large n. Similarly, a„ = ri(6„) if |a„/6„| is bounded 
away from for all large n. We write a„ x 6„ if a„ = 

0(a„) and a„ = fl{bn). 

3 Density- Sensitive Inference 

Let K(x) be a symmetric non-negative function and let 

Kh{x)=K{\\x\\/h). Let 



1 1 



(2) 



be the kernel density estimator of p with bandwidth hm, 
based on the unlabeled data. Define the support set esti- 
mate S — {x : pm {x) > 0} and the empirical boundary 



region 



n 



dS 



X : inf ||a; — z||2 < 2(5„ 
zeds 



where 5m = 2c2\/d ((log^ m)/™) for some constant 
C2 > 0. Now define a plug-in estimate of the distance 
as follows: 



7er(2;i,x2) J Pm(7(i)) 



dt, 



where r(a;i,a;2) = {76 r(a;i,X2) : Vt G 
[0,^(7)] 7(0 ^ S \ Hqs), and S«,„(a;i, ^2) = 00 
iff(a;i,a;2) = 0. 

We consider the following semisupervised learner which 
uses a kernel that is sensitive to the density. In the follow- 
ing definitions we take, for simplicity, K{x) = /(| | < 
!)■ 

Semisupervised kernel estimator: 



fh.a{^) 



(3) 



3.1 Performance upper bound for semisupervised 
estimator 

The following theorem characterizes the performance of 
the density sensitive semisupervised kernel estimator 

Theorem 1. Assume Aq > 1 + Cq for some constant cq > 

Oyland let e„i — Ci(\ogm)~^/^ for constant ci > and 
' 1 

Sm — 2c2\/d ((log m)/rnj for some constant C2 > 0. 
If To e (3(5™, cx)) and h > {2ci/{T^-\\o ~ 
where C4 > is a constant, then for large enough m 



sup E„^- 

{pJ)&VxYia) 



{fhAx)~f{x)YdP{x)\< 



'Ao 



3c32''Ao 



To 



-.2f) 



Ao 

K{AP/e + 2cj^) 



n 



'This assumption is more restrictive than necessary, and a 
more general statement can be by introducing a rescaling factor 
in the definition of the density-sensitive distance. 



The proof of Theorem [T] is given in section |6] The first 
term is negligible when the amount of unlabeled data m 
is large. The second term is the bias and third term is 
variance. If the bandwidth 

1 



h 



Om An 



and a x log m is large enough, then the density-sensitive 
semisupervised kernel estimator is able to achieve an in- 
tegrated MSE rate of 0{n^^) for all joint distributions 
in Vxyict) supported on sets with condition number 
To > 3(5„i. 

3.2 Performance lower bound for any supervised 
estimator 

We now establish a lower bound on the performance of 
any supervised estimator. 

Theorem 2. Assume d > 2 and a > 0. There exists 
a constant C5 > depending only on d so that if tq < 
c^n ti-i, then 



inf sup Er, 



ifix)-fix)fdP{x)^nil) 
where the inf is over all supervised estimators. 



Coupled with Theorem [T] the results state that if the 

condition number of the support set is small 35m < 

_ 1 

Tq < c^n <^-^ and a is large enough, then the density- 
sensitive semi-supervised estimator outperforms any su- 
pervised learning algorithm in terms of integrated MSE 
rate. 

A complete proof of Theorem[2]is given in the appendix. 
Here we provide some intuition regarding the proof strat- 
egy. We construct a set of joint distributions over X and 
Y that depends on n, and apply Assouad's Lemma. In- 
tuitively, we need to take advantage of the decreasing 
condition number tq. This is because if tq were to be 
kept fixed, as n increases the semi-supervised assump- 
tion would reduce to familiar Euclidean smoothness. 

So, we construct the distributions as follows. We split 
the unit cube in into two rectangle sets with a small 
gap in between, and let the marginal density p be uni- 
form over these sets. Then we add a series of "bumps" 
between the two rectangles, as shown schematically in 
Figure [T] Over one of the sets we set / = M, and over 
the other we set / = —A/. The number of bumps in- 
creases with n, implying that the condition number must 
decrease. The sets are designed specifically so that the 
condition number can be lower bounded easily as a func- 
tion of n. In essence, as n increases these boundaries 
become space-filling, so that there is a region where the 
regression function could be M or — A/, and it is not pos- 
sible to tell which with only labeled data. 



Support set of marginal density 





1 



Figure 1 : A two-dimensional cross-section of the support 
of a marginal density p used in the proof of Theorem[2] 



4 Adaptive Semisupervised Inference 



M > is a constant^ 

Theorem 3. Let T — {f^ i^}aeAMe'H denote the 
semisupervised kernel estimators trained on data T us- 
ing a ^ A and h ^ %. Use validation data V to pick 



(a, h) 



are mm 



and define the corresponding estimator /_ ^. Then, for 
every < (5 < 1, 



min £[£:(/„,,)] 

^ log(|.4||H|/^) 
nt 



In section 3.1 we established a bound on the integrated 
mean square error of the density-sensitive semisuper- 
vised kernel estimator The bound is achieved by using 
an estimate Da of the density-sensitive distance. How- 
ever, this requires knowing the density-sensitive param- 
eter a, along with other parameters. 

It is critical to choose a (and h) appropriately, otherwise 
we might incur a large error if the semisupervised as- 
sumption does not hold or holds with a different density 
sensitivity value a. The following result shows that we 
can adapt to the correct degree of semisupervisedness a 
if cross-validation is used to select the appropriate a and 
h. This implies that the estimator gracefully degrades 
to a supervised learner if the semisupervised assumption 
(sensitivity of regression function to marginal density) 
does not hold (a = 0). 

For any /, define the risk R{f) = E[{f{X) - Yf] and 
the excess risk £{f) = R{f) - R{f*) = E[(/(X) - 
where /* is the true regression function. Let 
■H be a finite set of bandwidths and let ^ be a finite 
set of values for a. Divide the data into training data 
T and validation data V. For notational simplicity, let 
both sets have size n. Let T = {f^ f^jaeA^hev. denote 
the semisupervised kernel estimators trained on data T 
using a £ A and h £ H. For each /^^ e let 
J = n-^ HUifl.HiX.) Y^f where the sum 
is over V. Let Y, = f{Xi) + a with 7V(0, a^). 

Also, we assume that \f{x)\,\f^i^{x)\ < M, where sary. 



where < a < 1 and < t < 15/{38{AP + cr^)) are 
constants. E denotes expectation over everything that is 
random. 

See appendix for proof. In practice, both % and A 
may be taken to be of size n° for some a > 0. Then 
we can approximate the optimal h and a with suffi- 
cient accuracy to achieve the optimal rate. Setting 6 — 
1 / (4M^n), we then see that the penalty for adaptation is 
+ SM = 0{\ogn/n) and hence introduces 
only a logarithmic term. 

5 Discussion 

Semisupervised methods are very powerful but, like all 
methods, they only work under certain conditions. 

We have shown that, when the support of the distribu- 
tion is somewhat irregular (i.e., the boundary of the sup- 
port of the density has a small condition number), then 
semi-supervised methods can attain better performance. 
Specifically, we demonstrated that a semi-supervised 
kernel estimator that uses a density-sensitive distance can 
outperform any supervised estimator in such cases. 

We introduced a family of estimators indexed by a pa- 
rameter a. This parameter controls the strength of the 
semi-supervised assumption. We showed that the behav- 
ior of the semi-supervised method depends critically on 
a. 

Finally, we showed that cross-validation can be used to 
automatically adapt to a so that a does not need to be 
known. Hence, our method takes advantage of the unla- 
beled data when the semi-supervised assumption holds, 
but does not add extra bias when the assumption fails. 
Preliminary simulations confirm that our proposed esti- 
mator adapts well to alpha and has good risk when the 



Note that the estimator can always be truncated if neces- 



semi-supervised smoothness holds and when it fails. We 
will report these results in future work. 

The analysis in this paper can be extended in several 
ways. First, it is possible to use other density sensitive 
metrics such as the diffusion distance [Lee and Wasser-| 
2008). Second, it is possible to relax the assump- 



man 



tion that the density p is strictly bounded away from on 
its support. Finally, other estimators besides kernel esti- 
mators can be used. We will report on these extensions 
elsewhere. 

6 Proof of Theorem [1] 

Here we prove Theorem [T] stated in section [3T| (repeated 
below for convenience), using some results given in the 
appendix. 

Theorem 4. Assume Aq > \ + cq for some constant cq > 
Ofland let em — ci([ogm)^^^^ for constant ci > and 

Sjn — 1ci\fd ((log to) / to) for some constant C2 > 0. 
//to e (35™, oo) and h > {2ci/[T^-\\o ~ e„0")) 
where C4 > is a constant, then for large enough m 



sup E„.„ 



1 



i^M-' + a') ( -+3c32'^Ao^^ 

' TO 



To 



Ao 



2P 



Ao 

K{NP/e + 2cj^] 



Proof. Let be the indicator of the event when the un- 
labeled sample is such that sup \p{x) — Pm{x)\ < 

xeS\TZas 

£,n and dS C TZgs- From Theoremjs] 



E„,™ <i (1 - Gm) / ihA^) - f{x)YdP{x) 



< — (M^ + cr^). 



We can write 



/ {fh.a{x) — f{x)YdP{x) 



This assumption is more restrictive than necessary, and a 
more general statement can be by introducing a rescaling factor 
in the definition of the density-sensitive distance. 



where S"^ as defined in Proposition |2] For the boundary 
region we have 



s\s^ 

< {M^+a')PiS\S*J 

< Ao{M^ +cr^)Leh{S\S*,) 

where Leb denotes the Lebesgue measure. Since the ra- 
dius of curvature of dS is at least To, and To > 3(5„i, we 
have by Proposition[3] 



Lch{S\S*J < Vol{dS) 



(tq + 3(5„,.) - 



<C3 



3 6 in 

Tq 



d\ 3S„ 



- 1 



1/ To 



< 3c32' 



To 



where Vol denotes the d — 1 -dimensional volume on dS. 
So 



/ ifkA^)-fix)rdPix) 



To 



Following the derivation in Chapter 5 of |Gyorfi et al.| 
[2002 J , we have 



E„<(g™y ifhA^) - fix)ydp{x) 

<GmCl sup sup Da{x,x'y^ 



AP/e + 2a^ f ^ h 

n \ 2 



where 5'^,°'"' — {x' : Da,mix,x') < /i}, and A/" denotes 
the covering number Note that since r{x,x') — 
Da,m = 00, we will always have {x,x') E if x' E 
S n S'^/"'" (and, of course, the same applies when x' £ 
apply Proposition |2j to give 



Gm sup sup Da{x,x')'^'^ < \hi 
x<£S^ , c. c.Sc ™ L V 



■ x'esns. 



Ao + e?i 
An 



2f3 



and 

Gm■^f ^S'*,, Da,m, - ^mM (^S*^, ds^, ^ " ^ 

where the ^5. distance is the length of the shortest path 
between two points restricted to S^, as defined in the 
appendix. Clearly 5*^ has condition number at least tq ~ 
36m > 0. If Sm has exactly one connected component, 
then Proposition |4] combined with the assumption that 
h > (2,04/ {tq^^ {Xq — Em)") implies that any point in 
is a /i(Ao — e„i)"/2 covering, so 



= 1. 



Since 5*, can have at most K connected components, 
we can repeat the same argument for each component 
and conclude that 



< K. 



So, 



<(Af2 + a2)(l+3c32%^ 
, m To 



An + e„ 



-,2/3 



Ao 

K{Mye + 2^2) 



□ 
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Appendix 

Results used in proof of Theoremjl] In order to prove 
Theorem [T| we characterize how the plug-in density- 
sensitive distance estimate Da behaves. For this, we start 
with a result about the density estimator 

Theorem 5. If m > mo, where toq = mo(Ao, Aq) is a 
constant, then for all marginal densities p of distributions 
in VxY (a), we have with probability > 1 — 1 /m, 

sup \p{x) -Pm{x)\ < e„i and dS C TZqs 

xesyRgs 

where = ci(logm)^^/^ for constant ci = 

ci{K, C2, d, 1], Aq), S = {x : Pm{x) > 0}, and 



T^dS = -{x : inf \\x - z\\2 < 5„ 

z£dS 



where bra = '2.ci\fd ft 
0. 



or some constant C2 > 



Proof. Follows from Theorem 1 in [Singh et aT 1 2008a) 
by noting that since the density estimate will be a.s. 
outside the boundary region, and we have p > Aq on S, 
for sufficiently large m (i.e. small em), we must have 
S\ngs QS CSU Uos- □ 



must also be in the same connected component. For 

{xi,x2) e 



Da,mixi,X2) 



Lil) 



inf 



1 p(7(t))" 



; p(7W)"Fm(7W)" 




dt 



< Jnf 

7er(a;i,X2) 



i(7) 





^ I ^^(^) 

< sup 

zeS\Kas \Pm[z) 

and 

f P{z) 

sup 

z£S\nos \Pm[Z) 



Pi-fit))- 



-dt 



te[0,L(7)] \Prnilit)) 



Da,mixi,X2) 



< sup ( P^'^ 

~ zeS\TZas + 



< 



Aq 

(Ao — em)+ 



So 



Da,mixi,X2) < ( tt ^° , ) Da.mixi,X2). 



Similarly, 



The following two propositions now characterize how 
the plug-in density-sensitive distance estimate Da be- 
haves. 

Proposition!. Assume sup \pmix) ~ pix)\ < em 
xes\nos 

and dS C TZas- Let 



i(7) 

Da,mixi,X2)= juf / 

'rer{xi,x2) J mW)" 



dt 



and'^ = {ixi,X2) : 2:^1,2:2 € S\JZqs, r(a;i,a;2) 7^ 0}. 
Then for any (a; 1,2:2) € ^, 



An 



Aq + Eiri 



£'a,m(2;i,2;2) < I^q,™ (2:1 , 2:2 ) 



< 



An 



(Aq — em) + 



Da,m{xi,X2). 



Proof. Note that by the triangle inequality, TZos Q T^dS, 
so S\TZds ^ S\TZds since tq > 2Sm for m large 
enough. We see that if (a::i,a;2) G 5*, then x and y 
must be in the same connected component of S\TZds, 
and, furthermore, all points along any path in r(2;i, 2:2) 



Da,mixi,X2)> inf 



piz) 



<£S\1Zgs \piz) + er. 
-^0 \ 



Doi,m,{xi, X2) 



> I V^T^ ) Da,mixi,X2). 

, Aq + Cr; 



□ 



Given a set A C W^, define 



dAixi,X2)^ inf L(7) 
7erA (2:1,2:2) 

where ryi(a;i,a;2) = {76 r(xi,X2) : Vt G 
[0,i(7)] 7W e^}- 

Proposition 2. Wi'f/i f/ze notation of Proposition^ for all 

Xi,X2, 

Da,mixi,X2) < Da^mixi,X2). 

Assume sup \pmix) — pix)\ < em and dS C TZds- 

xes\nos 
Then for any ixi,X2) G 



£'a,m(2;i,2;2) < 



and 



A + e, 



where = i x ^ S : inf ||a; — z\\2 > 35„ 

zEdS 



Proof. Since for any xi and a;2, r(a::i, X2) C r(a;i,a;2), 
clearly 2:2) < DaM^i^^^)- If dS c TZas, 

write 



i(7) 



Da,ni{xi,X2) = Jnf 

7Gr(a:i,a:2) 



p(7(t))" 



< 



< 



sup 



p(z)" 



ini I dt 



< 



-^0 



(a;i,a;2) 



since, by the triangle inequality, 5^ C S\TZqs- Apply- 
ing Proposition [T] the result follows. □ 

To prove Theorem[T] we also need the following two re- 
sults. 

Proposition 3. Let X be a compact subset of W^, and 
T > 0. Then for any r e (0, T), for all sets S <^ X with 
condition number at least r, Vol(9S') < c^/t for some 
C3 independent of t, where Vol is the d — 1-dimensional 
volume. 

Proof. Let {zi}fLi be a minimal Euclidean r/2- 
covering of dS, and Bi — {x : \\x — Zi\\2 < t/2}. Let 
Ti be the tangent plane to dS at zi. Then using the argu- 



ment made in the proof of Lemma 4 in Genovese et al. 
I I20T0I , 



Vol(B, n dS) < Ci Vol(B, n Ti] 



1 



for some constants Ci and C2 independent of r. Since 
X is compact. 



Af{dS,\\-\\2,T/2)<C 



for some constant C depending only on X and T, where 
J\f denotes the covering number (note that even though 
dS isad — 1 dimensional set, we can't claim Af{dS, \\ ■ 



\\2,t) = 0{t ^'^ ^^), since 55" can become space-filling 
as T — > 0). So 



N 



Voi(as') < J2 Voi(s, n dS) 

<C2r'^-W(95,||-||2,r/2) 
< C2Ct-^ 

and the result follows with C3 = C2C. 



□ 



Proposition 4. Let X be a compact subset of W^, and 
T > 0. Then for any T G (0,T), for all compact, con- 
nected sets S X with condition number at least t, 
sup ds{u,v) < c^T^^'^ for some 04 independent of t. 



u,ves 



Proof. First consider the quantity 

sup ds{u,v). 

Since dS C S, clearly 

sup ds{u,v) < sup dQs{u,v). 

Since dS is closed, there must exist u*,v* e dS such 
that 

sup dosiu,v) ^ dds{u* ,v*). 

Let {zi}fLi be a minimal r-covering of dS in the dgs 
metric. Let {%}fLi C {zi}fLi such that dgs{u* < 
T, dQs{v* ,Zj;;j) < T, and for any 1 < i < — 1, 
d9s{zt,Zi+i) < 2r. Then 

dds{u*,v*) < dQs{u*,zi) + dQs{v*,Zj:^) 

N-l 

+ X] dos{zi,Zi+i) 

i=l 

< 2tN. 

So, 

d9siu\v*) < 2Tj^{dS,das,T). 
By Proposition 6.3 in Niyogi et al. 1 20081 (or see Lemma 



3 in Genovese et al.^[2010J ), if x,y d dS such that — 
yh = a < t/2, then dQs{x,y) < r - t^/I - (2a)/T. 
In particular, if II a; — y II 2 < r/2, then dds{x,y) < t. So 
any Euchdean T/2-covering of dS is also a r-covering 



in the dgs metric. Then we have 



We now prove Theorem |2] 



sup ds{u,v) < dgsiu* ,v*) 

< 2TM{dS,das,r) 

< 2TAf{dS,\\ ■ ||2,r/2) 

for some constant C depending only on X and T (note 
that, as in the proof of [3] even though dS is a d — 1 
dimensional set, we can't claim Af{dS,\\ ■ \\2,t) — 
(9(7--(<i-i))^ since dS can become space-filling as r — 
0). 

Now let , €E S such that 

sup ds{u,v) — ds{u^ ,v^) 

u,v£S 



Proof. Construction: 

Let I = [coni/('^"i)j with cq > 1 a constant, q = l*^^^, 
n = {0, 1}9 and e = iqL. For i e {1, I}, let a, = 

i±^. ForTe {l,...,?r-Meti;, = (a, , a._^_ J. 

Define g : R'^-^ ^ M as .g(5) = 

r + \J {j - rY - Ml for ||5||2<^-?- 

^ r-^Jr'-ik-mhY for ^ ~ r < < i 

o.w. 



which must exist since S is compact. Let , S dS 
be the (not necessarily unique) projections of and 
onto 95*. Clearly the line segment connecting and 
is fully contained in S, and the same applies to and 
w*. So 

ds(M^, W^) < d5(M^, U"'') + ^5(1*"'', + ds{v^,v^) 

< \\u^ — u^\\2 + Wv^ — v^\\2 + ds{u* ,v*) 

and setting C4 = 2T'^^^ diam(A')+C, the result follows. 

□ 

Proof of Theorem|2] The proof of Theorem|2]is based 
on the following result based on Assouad's Lemma (see 
e.g. |Tsybakovlp009) ). 

Theorem 6. Let — {0, l}"^, the collection of binary 
vectors of length q > 1. Let Vq = {P'^,^^ G ^} be 
the corresponding collection of 2'^ probability measures 
associated with each vector. Also let || A P^\\ denote 
the affinity between two distributions (i.e. |jP" AP"|| = 
1 — sup \ P'^ {A) — P^ {A)\, where the supremum is over 

A 

all measurable sets), and p{-,-) denotes the Hamming 
distance between two binary vectors. For any semi- 
distance d 



inf ] 



iE.M'(r,r)]> 



X I min IIP'^ A P"^ 



for i e M'^ ^, where r E (0, 1/4), to be specified later. 

Let P = {{x,xa) G [-0.5,0.5]'^-'^ x [0,1] : < 
gix)}. For? e let P. = {(i,Xd) e 

R'^-'^ X M : ((£ - v:^)/e,Xd - 1/8) G P} and P- = 
{{i,Xd) G M'^-ixM : ((i-wj)/e,a:rf-(l/8+r)) e P}. 
Let S = {x e R"^ : 3x' = (i',x^) G [e, 1 - -i^^-i 



[e, o — e] s.t. Ilx — 



^ „„ „ „^ < e} and {x e M'' ; 3a;' = 

(5', a;;;) e [e, l-e]"^-^ x [|+r+e, 1-e] s.t. ||a;-a;'||2 < 

e}. For any T C {1, l^-^, let = 5 U ^ IJ 

and 5r = S'\ IJ P^ . Let T be an arbitrary order- 

\ier 7 

ing of Given lo £ ajet T{lu)_ = {f, : 

= 1}, and let 5"^ = S^^^^y S — S'r(aj), and 

Lctp^ix) = j^f^, f-{x) = M/5^(x) - A/V(a;), 
and p{y\x)'^ = 5{y — /"(a;)), where 6{-) is the Dirac 
delta (we could also use a conditional distribution that is 
absolutely continuous with respect to Lebesgue measure; 
the result would be the same). Finally, let P'^ denote the 
measure on M'^+^ defined by p"(a:) andp"(2/|a:), and P^ 
the corresponding product measure. 

Proof of f}(l) rate: 

Note that Leb(Pj) — Lcb(P:^), and so for any uj,uj', 
Lcb(S'") LebCS*"') = Lcb(5) + Lcb(S^). Let A = 
l/(Lcb(5) +Leb(5)),i.e. A = 1/ Leb(S'") for any cj. 

Let uj,uj' E fl such that p{uj,ll!') = 1 (where p denotes 
the hamming distance), and WLOG assume oji — and 
w- = 1. Also denote i = Ti. Then the LI distance 



between and P'^ is 



Also we have 



di(P",P" ) 

\p''{x)p''{y\x)-p''\x)p''\y\x)\dydx 

\\p"{y\x)-\p''\y\x)\dydx 

Ap" (y\x)dydx 



irix) - r\x)fp'^{x)dx ^ j {r{x) - r\x)fdx 

{r{x)~ r'{x)fdx 



{r{^)-r {x)fdx~ 

= d^{r,r') - M^Yol{S'^'\S'^) 
> d^ir, /"') - M^qe'^-^ Leh{B\B, 



d'{r,r)-M' 



Since A > 1, 



1 + 2 



Xp'^ {y\x)dydx + 

^ \Xp'^{y\x)~Xp'^'{y\x)\dydx 

= + A Leb(Pj\Pj) + A Leb(Pj\i?j) 

+ 2ALcb(BjnPj) 
= A(Leb(Pj) +Leb(P-)) 
= 2Ae'^-iLeb(P) 

where in the first step we have used the fact that x ^ 
5" U 5"' ^ p'^(x) = = 0, and divided 5" U 

5" into four non-intersecting components. Then we can 

bound the affinity of the product measures and Assume n > 2'*. Then I > 2 and 

for p{uj, w') = 1 as 



(Leb(P)-Leb(B^)). 



inf sup E„ / (fix) ~ f{x)fdP{x) 

> inf maxE„ / {f{x) - ^ {x)fp'^ {x)dx 

i\/r2(j \d-l 

> ^-^ (Leb(P) + Leb(P^))(l - Ae'^-^ Leb(B))" 



Af2 



1 + 2 



d-l 



(Leb(P) -Leb(P^)). 



1 + 2 



> 



2d-r 



\\pi:AP::\\>{i-d,iP'^,p-)/2r 

= (1 - Ae'^-iLcb(P))". 



For any uj ^ uj', denoting as w A the logical and of w — Ae"*^^ Lcb(P))" > ( 1 — 

and cj', we have, for arbitrary j e {1, Z}''"^, V Cq ? 



Clearly Leb(P) < i. Let cq > 3. Then e < 1/8 and 

A < (1 - 2e)-(''-i)(l - 4e - r)-^ < 2'^+\ so 



-2''/4-' 



M'^dx 



AM'^dx 



= ^ E 

= uj'){M'^ Leb(i?jAP-) + AM'^ Leb(Pj n Pj)) 
= 2p(a;, w')Af^(Leb(P-) + Leb(Pj n Pj)) 
= 2p(a;,w')A^^e''"^(Leb(P) +Leb(P^)) 

where we define — {x ^ B : x — {Q, 0, r) e P}. 
Then by Theorem |6] 

infmaxE^[d2(/-,/2)] 

a; cjGO 

> (Leb(P) + Lcb(P,))(l - Ae-^-i Leb(P))". 
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So if we let Co > (2^^/ log(5/4))i/(''-i), then 
'^0 > 4/5 and for sufficiently large n we will 
have (1 - Xe'^-^ Leb(P))" > 4/5. Hence, 

inf sup E„ / {fix) - f{x)fdP{x) 

7 {pJ)eVxYia) J 



> 



5 • 2'i-2 



(Leb(P,.)-2Leb(P\P^)). 



Since 



and 



Leb(P^) 



2 V2 



Leb(P\Pr) < r 



T{d/2 + 1) 



Jd~l)/2 



2'^-ir((d- l)/2 + l) 



where F is the gamma function, then 



which is smaller than er, so for n sufficiently large, 



Leh{Br) -2Lch{B\Br 



-2\2 V r(d/2 + l) ''2'^-2r((d- l)/2 + l) 



> 



TT 



(i/2 



Now let r be such that 



(l-2r)" 



Ad 



(it is easy to see that this can be satisfied by some r e 
(0, 1/4) for any d > 1). So we have 

inf sup E„ / {f{x) - f{x)fdP{x) 

f ip.f)eVxY{a) J 

> 



5.22d-ir(d±i)rf- 



Verifying condition number: 

Let t{A) be the condition number of a set A. Then for 
arbitrary to, 



t(5") = min<j r(S:"),T(S'"') i inf inf \\u - v\\2 

2 uGS" t,gs" 



Due to the shape of the function g, for arbitrary i G 

{1, l}"^^^ we have 

r(5:") > min {T{dS),T{dB:;\dS)} 



By definition of 5 it is easy to see that T{dS_) = e. Also 
rm:^^S) 

= T({(i, Xd) G [-e/2, e/2]'^-i x [0, 1] : = g{x/e)}) 
= er({(i, Xd) e [-1/2, 1/2]'^-! x [0, 1] : = g{x)}) 
— er. 



> 



2(coni/(d-i) +2) 



1 



1 

- + 

4 2 



1 



2(co + 1) 
which completes the proof. 



□ 



Proof of Theorem |3] First, we derive a general con- 
centration of £{f) around £{f) where £{f) ~ R{f) — 
Rin = U^, and U, = -{Y, - f{Xi)f + 

If the variables Ui satisfy the following moment condi- 
tion: 



E[|C7,-E[f// 



< 



var{Ui) 



kl' 



,k-2 



for some r > 0, then the Craig-Bernstein (CB) inequality 
(Craig 1933) states that with probability > 1 — S, 



n 



log(l/^) tvar{Ui[ 



4=1 



nt 



2(1 - c) 



for < tr < c < 1. The moment conditions are sat- 
isfied by bounded random variables as well as Gaussian 
random variables (see e.g. [Haupt and Nowak 1 2006) ). 

To apply this inequality, we first show that var(C/i) < 
4(M2 + a^)£{f) since Y, = f{Xi) + e, with e, 
A/'(0,cr2). ^jgQ^ assume that |/(a;)|, \f{x)\ < M, 
where Af > is a constant. 

variU^) < E[Uf] 

= E[{-{Y,-f(x,)r + (Y,-r{x,)m 

-£[(-(/* (X,)+e.-/(X0)2 + (£.)')'] 

= E[(-(r (X,) - /(X,))2 - 2e,(/*(X,) - fiX^W] 

< 4M^£if) + A(j'^£{f) ^ A{M^ + a^)£{f) 



Since r < 1, we have t(5") > re, and similarly 
t(5") > re. Now, 



o 1^1 ^^^^ \\u-'"h 

> J . , ll(w,gH)-(«,5(") + ^)ll2 

2 ti,i>G[-0.5,0.5]''-i 



Therefore using CB inequality we get, with probability 

> 1-5, 



£if)-£if)< 



\og{l/d) , i2(A/2+a2)£(/) 



nt 



(1-c) 



Now set c = tr = 8i(Af2 + a^)/15 and let t < 
15/(38(A/2 + CT^)). With this choice, c < 1 and define 

a= ^- ^ < 1. 

11 (1-^) 



Then, using a and rearranging terms, with probability > 



nt 



where t < 15/(38(M2 + a^)). 



Then, using the previous concentration result, and taking 
union bound over all f £ J^, we have with probability 
> 1-5, 



1 



^ 1^ 
Now consider 



iog(l-^l/^) 

nt 



< 



1 



< 



l-a 
1 

1^ 



nt 
nt 



Taking expectation with respect to validation dataset. 



1 



R{f) - R{n + 



logm/sy 

nt 



Now taking expectation with respect to training dataset. 



nt 

Since this holds for all / e J^, we get: 
1 



< 



l-a 
+46 M^. 



jdnEriSif)] 



nt 



The result follows since T = {f^ h}aeA,heH and \T\ 
\A\\U\. 



□ 
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